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^ Abstract: Student's t statistic is finding applications today that were never envis- 

+^ aged when it was introduced more than a century ago. Many of these applications 

^ rely on properties, for example robustness against heavy tailed sampling distributions, 

that were not explicitly considered until relatively recently. In this paper we explore 
^ these features of the t statistic in the context of its application to very high dimen- 

sional problems, including feature selection and ranking, highly multiple hypothesis 
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testing, and sparse, high dimensional signal detection. Robustness properties of the 
cn t-ratio are highlighted, and it is established that those properties are preserved under 

^ applications of the bootstrap. In particular, bootstrap methods correct for skewness, 

O ^-iid therefore lead to second-order accuracy, even in the extreme tails. Indeed, it is 

shown that the bootstrap, and also the more popular but less accurate t-distribution 
and normal approximations, are more effective in the tails than towards the middle 
^ of the distribution. These properties motivate new methods, for example bootstrap- 

based techniques for signal detection, that confine attention to the significant tail of 
a statistic. 
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1 Introduction 



Modern high-throughput devices generate data in abundance. Gene microarrays com- 
prise an iconic example; there, each subject is automatically measured on thousands 
or tens of thousands of standard features. What has not changed, however, is the 
difficulty of recruiting new subjects, with the number of the latter remaining in the 
tens or low hundreds. This is the context of so-called "p » n problems," where p 
denotes the number of features, or the dimension, and n is the number of subjects, 
or the sample size. 

For each feature the measurements across different subjects comprise samples from 
potentially different underlying distributions, and can have quite different scales and 
be highly skewed and heavy tailed. In order to standardise for scale, a conventional 
approach today is to use ^-statistics, which, by virtue of the central limit theorem, 
are approximately normally distributed when n is large. W. S. Gosset, when he 
introduced the Studentised ^-statistic more than a century ago (Student, 1908), saw 
that quantity as having principally the virtue of scale invariance. In more recent 
times, however, other noteworthy advantages of Studentising have been discovered. 
In particular, the t statistic's high degree of robustness against heavy-tailed data 
has been quantified. For example, Gine, Gotze and Mason (1997) have shown that a 
necessary and sufficient condition for the Studentised mean to have a limiting standard 
normal distribution is that the sampled distribution lie in the domain of attraction 
of the normal law. This condition does not require the sampled data to have finite 
variance. Moreover, the rate of convergence of the Studentised mean to normality 
is strictly faster than that for the conventional mean, normalised by its theoretical 
(rather than empirical) standard deviation, in cases where the second moment is only 
just finite (Hall and Wang, 2004). Contrary to the case of the conventional mean, 
its Studentised form admits accurate large deviation approximations in heavy-tailed 
cases where the sampling distribution has only a small number of finite moments 
(Shao, 1999). 

All these properties are direct consequences of the advantages conferred by divid- 
ing the sample mean, X, by the sample standard deviation, S. Erratic fluctuations 
in X tend to be cancelled, or at least dampened, by those of S, much more so than 
if S were replaced by the true standard deviation of the population from which the 
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data were drawn. 

The robustness of the ^-statistic is particularly useful in high dimensional data 
analysis, where the signal of interest is frequently found to be sparse. For any given 
problem (e.g. classification, prediction, multiple testing), only a small fraction of 
the automatically measured features are relevant. However the locations of the useful 
features are unknown, and we must separate them empirically from an overwhelmingly 
large number of more useless ones. Sparsity gives rise to a shift of interest away from 
problems involving vectors of conventional size to those involving high dimensional 
data. 

As a result, a careful study of moderate and large deviations of the Studentised 
ratio is indispensable to understanding even common procedures for analysing high 
dimensional data, such as ranking methods based on ^-statistics, or their applications 
to highly multiple hypothesis testing. See, for example, Benjamini and Hochberg 
(1995), Pigeot (2000), Finner and Roters (2002), Kesselman et al. (2002), Dudoit 
et al. (2003), Bernhard et al. (2004), Genovese and Wasserman (2004), Lehmann ct 
al. (2005), Donoho and Jin (2006), Sarkar (2006), Jin and Cai (2007), Wu (2008), Cai 
and Jin (2010) and Kulinskaya (2009). The same issues arise in the case of methods for 
signal detection, for example those based on Student's t versions of higher criticism; 
see Donoho and Jin (2004), Jin (2007) and Delaigle and Hall (2009). Work in the 
context of multiple hypothesis testing includes that of Lang and Secic (1997, p. 63), 
Tamhane and Dunnett (1999), Takada et al. (2001), David et al. (2005), Fan et 
al. (2007) and Clarke and Hall (2009). 

In the present paper we explore moderate and large deviations of the Studentised 
ratio in a variety of high dimensional settings. Our results reveal several advantages 
of Studentising. We show that the bootstrap can be particularly effective in relieving 
skewness in the extreme tails. Attractive properties of the bootstrap for multiple 
hypothesis testing were apparently first noted by Hall (1990), although in the case of 
the mean rather than its Studentised form. 

Section 2.1 draws together several known results in the hterature in order to 
demonstrate the robustness of the t ratio in the context of high level exceedences. 
Sections 2.2 and 2.3 show that, even for extreme values of the t ratio, the bootstrap 
captures particularly well the influence that departure from normality has on tail 
probabilities. We treat cases where the probability of exceedence is either polynomi- 
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ally or exponentially small. Section 2.4 shows how these properties can be applied 
to high dimensional problems, involving potential exceedences of high levels by many 
different feature components. One example of this type is the use of i-ratios to imple- 
ment higher criticism methods, including their application to classification problems. 
This type of methodology is taken up in section 3. The conclusions drawn in sec- 
tions 2 and 3 are illustrated numerically in section 4, the underpinning theoretical 
arguments are summarised in section 5, and detailed arguments are given by Delaigle 
et al. (2010). 

2 Main conclusions and theoretical properties 

2.1 Advantages and drawbacks of student ising in the normal 
approximation 

Let Xi,X2, ■ ■ ■ denote independent univariate random variables all distributed as X, 
with unit variance and zero mean, and suppose we want to test Hq : /j = against 
Hi : n > 0. Two common test statistics for this problem are the standardised mean 
Zq and the Studentised mean Tq, defined by Zq — n}/"^ X and Tq — Zq/S where 



In practice, experience with the context often suggests the standardisation that 
defines Zq. Although both Zq and Tq are asymptotically normally distributed, di- 
viding by the sample standard deviation introduces a degree of extra noise which 
can make itself felt in terms of greater impact of skewness. However, we shall show 
that, compared to the normal approximation to the distribution of Zq, the normal 
approximation to the distribution of Tq is valid under much less restrictive conditions 
on the tails of the distribution of X. 

These properties will be established by exploring the relative accuracies of normal 
approximations to the probabilities P{Zq > x) and P{Tq > x), as x increases, and 
the conditions for validity of those approximations. This approach reflects important 
applications in problems such as multiple hypothesis testing, and classification or 





i=l 



1=1 



denote the sample mean and sample variance, respectively, computed from the dataset 

Xl, . . . , Xn- 
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ranking involving high dimensional data, since there it is necessary to assess the 
relevance, or statistical significance, of large values of sample means. 

We start by showing that the normal approximation is substantially more robust 
for To than it is for Zq. To derive the results, note that if 

E\Xf < oo (2.2) 

then the normal approximation to the probability P{Tq > x) is accurate, in relative 
terms, for x almost as large as n^/^. In particular, P{To > x)/{l — $(x)} 1 
as n — >■ oo, uniformly in values of x that satisfy < x < en^/^, for any positive 
sequence e that converges to zero (Shao, 1999). This level of accuracy apphes also 
to the normal approximation to the distribution of the nonstudentised mean, X, 
except that we must impose a condition much more severe than (2.2). In particular, 
P{Zo > x)/{l - $(x)} 1, uniformly in < x < n^^/^^"'', for each fixed 77 > 0, if 
and only if 

E{ exp } < 00 for all c e (O, |) ; (2.3) 

see Linnik (1961). Condition (2.3), which requires exponentially light tails and implies 
that all moments of X are finite, is much more severe than (2.2). 

Although dividing by the sample standard deviation confers robustness, it also 
introduces a degree of extra noise. To quantify deleterious effects of Studentising we 
note that 

P{To>x)^{l-^x)}{l-n-^/^lx^^ + o{n-^/^x^)}, (2.4) 
P{Zo >x) = {l- $(2;)} {1 + i a;^ 7 + o{n-^^^ x^) } , (2.5) 

uniformly in x satisfying A„ < x < n^/^A„, for a sequence A„ — > 00, and where $ is 
the standard normal distribution function and 7 = E{X^) (Shao, 1999; Petrov, 1975, 
Chap. 8). (Property (2.2) is sufficient for (2.4) if x — > 00 and — > as n — > 00, 
and (2.5) holds, for the same range of values of x, provided that, for some u > 0, 
-E{exp(-u \X\)} < 00.) Thus it can be seen that, if 7 7^ and n'^^"^ x^ is small, the 
relative error of the normal approximation to the distribution of Tq is approximately 
twice that of the approximation to the distribution of Zq. 

Of course. Student's t distribution with n or n — 1 degrees of freedom is identical to 
the distribution of Tq when X is normal N(0, cr^), and therefore relates to the case of 
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zero skewness. Taking 7 = in (2.4) we see that, when Tq has Student's t distribution 
with nor n—1 degrees of freedom, we have ^(T'o > x) — {1 — $(x)} {l + o(n~^/^ ^^)}- 
It can be deduced that the results derived in (2.4) and (2.5) continue to hold if 
we replace the role of the normal distribution by that of Student's t distribution 
with n or n — 1 degrees of freedom. Similarly, the results on robustness hold if we 
replace the role of the normal distribution by that of Student's t distribution. Thus, 
approximating the distributions of Tq and Zq by that of a Student's t distribution, 
as is sometimes done in practice, instead of that of a normal distribution, does not 
alter our conclusions. In particular, even if we use the Student's t distribution, Tq is 
still more robust against heavy tailedness than Zo, and in cases where the Student 
approximation is valid, this approximation is slightly more accurate for Zq than it is 
for To. 

2.2 Correcting skewness using the bootstrap 

The arguments in section 2.1 show clearly that Tq is considerably more robust than Zq 
against heavy-tailed distributions, arguably making Tq the test statistic of choice even 
if the population variance is known. However, as also shown in section 2.1, this added 
robustness comes at the expense of a slight loss of accuracy in the approximation. 
For example, in (2.4) and (2.5) the main errors that arise in normal (or Student's t) 
approximations to the distributions of Tq are the result of uncorrected skewness. In 
the present section we show that if we instead approximate the distribution of Tq using 
the bootstrap then those errors can be quite successuUy removed. Similar arguments 
can be employed to show that a bootstrap approximation to the distribution of Zq is 
less affected by skewness than a normal approximation. However, as for the normal 
approximation, the latter bootstrap approximation is only valid if the distribution 
of X is very light tailed. Therefore, even if we use the bootstrap approximation, Tq 
remains the statistic of choice. 

Let X* = {XI, . . . ,X*} denote a resample drawn by sampling randomly, with 
replacement, from X — {Xi, . . . , and put 

X* ^ ^ S*^ i^i - ' ^0 = ^^^^ i^* - ^) IS* ■ (2.6) 

^ i=i ^ i=i 

The bootstrap approximation to the distribution function G{t) — P{Tq < t) is G{t) — 
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P(Tq <t\X), and the bootstrap approximation to the quantile ta — {l — G) ^{a) is 

t^={l-Gy\a). (2.7) 



Theorem 1, below, addresses the effectiveness of these approximations for large values 
of X. 

As usual in hypothesis testing problems, to calculate the level of the test we take 
a generic variable that has the distribution of the test statistic and we calculate the 
probability that the generic variable is larger than the estimated 1 — a quantile. 
This generic variable is independent of the sample, and since the quantile of the 
bootstrap test is random and constructed from the sample then, to avoid confusion, 
we should arguably use different notations for Tq and the generic variable. However, 
to simplify notation we keep using Tq for a generic random variable distributed like 
Tq. This means that we write the level of the test as P{Tq > to), but here Tq denotes 
a generic random variable independent of the sample, whereas ta denotes the random 
variable defined at (2.7) and calculated from the sample. In particular, here Tq is 
independent of ta- 

Define Za — {1 — ^)~^{a), and write Pp for the probability measure when X 
is drawn from the population with distribution function F. Here we highlight the 
dependence of the probabilities on F because we shall use the results in subsequent 
sections where a clear distinction of the distribution will be required. 

Theorem 1. For each B > 1 and Di > there exists D2 > 2, increasing no faster 
than linearly in Di as the latter increases, such that 

Pf{To > fa) = a [1 + 0{(1 + Za) + (1 + ^J^^-^}] +0(71"^^) (2.8) 

as n — )■ 00, uniformly in all distributions F of the random variable X such that 
£;(|X|^2) < E{X) = and E{X'^) = I, and in all a satisfying < Za < Bn^l^. 

The assumption in Theorem 1 that E{X'^) = 1 serves to determine scale, without 
which the additional condition E{\X\^^) < B would not be meaningful for the very 
large class of functions considered in the theorem. The theorem can be deduced 
by taking c = in Theorem B in section 5.1, and shows that using the bootstrap to 
approximate the distribution of Tq removes the main effects of skewness. To appreciate 
why, note that if we were to use the normal approximation to the distribution of Tq 



we would obtain, instead of (2.8), the following result, which can be deduced from 
Theorem A in section 5.1 for each B > 1 such that < B and < Za < Bri^/^: 

Pf{To >Za)^a exp ( - ^-^/^ i ^3 ^ q^^^ ^ ^-1/2 ^ ^ Za^n-^} . 

(2.9) 

Comparing (2.8) and (2.9) we see that the bootstrap approximation has removed the 
skewness term that describes first-order inaccuracies of the standard normal approx- 
imation. 

The size of the 0{n~^^) remainder in (2.8) is important if we wish to use the 
bootstrap approximation in the context of detecting p weak signals, or of hypothesis 
testing for a given level of family-wise error rate or false discovery rate among p 
populations or features. (Here and below it is convenient to take p to be a function of 
n, which we treat as the main asymptotic parameter.) In all these cases we generally 
wish to take a of size p~^, in the sense that pa is bounded away from zero and infinity 
as n ^ cxo. This property entails Za = 0{(logp)^/^}, and therefore Theorem 1 implies 
that the tail condition E{\X\^'^) < 00, for some D2 > 0, is sufficient for it to be 
true that "Pf(3o > ta)/oi — 1 + o(l) for p — o{n^^) and uniformly in the class of 
distributions F of X for which E{X) = 0, EiX"^) = 1 and £^(1^1^^) < 00." 

On the other hand, if, as in Fan and Lv (2008), p is exponentially large as a 
function of n, then we require a finite exponential moment of X. The following 
theorem addresses this case. In the theorem, D2 < 2 unless Di = |, in which case 
D2 = 2. The proof of the theorem is given in section 5.2. 

Theorem 2. For each B > 1 and Di e (0, |] there exists D2 G (0,2], increasing no 
faster than linearly in Di as the latter increases, such that 

Pf{To > fa) = a [1 + 0{ (1 + Za) n-'/^ + (1 + z^)^ n"^}] + 0{ exp ( - n-^^) } 

(2.10) 

as n — >■ 00 ; uniformly in all distributions F of the random variable X such that 
P{\X\ > x)<C exp{-x^^) (where C > 0), E{X) = and E{X'^) = I, and in all a 
satisfying < < Bn}/^. 

Theorem 2 allows us to repeat all the remarks made in connection with Theorem 1 
but in the case where p is exponentially large as a function of n. Of course, we need 
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to assume that exponential moments of X are finite, but in return we can control a 
variety of statistical methodologies, such as sparse signal recovery or false discovery 
rate, for an exponentially large number of potential signals or tests. Distributions with 
finite exponential moments include exponential families and distributions of variables 
supported on a compact domain. Note that our condition is still less restrictive than 
assuming that the distribution is normal, as is done in many papers treating high 
dimensional problems, such as for example Fan and Lv (2008). 

2.3 Effect of a nonzero mean on the properties discussed in 
section 2.2 

We have shown that, in a variety of problems, when making inference on a mean it 
is preferable to use the Studentised mean rather than the standardised mean. We 
have also shown that, when the skewness of the distribution of X is non zero, the 
level of the test based on the Studentised mean is better approximated when using 
the bootstrap than when using a normal distribution. Our next task is to check that, 
when i^o • = is not true, the probability of rejecting Hq is not much affected by 
the bootstrap approximation. Our development is notationally simpler if we continue 
to assume that E[X) = and var (X) = 1, and consider the test Hq : ji — —crT^I'^ 
with c > a scalar that potentially depends on n but which does not converge to 
zero. We define 

= n^/^ {X + , Tc = Z^jS . (2.11) 

Here we take fj, of magnitude n~^^^ because this represents the limiting case where 
inference is possible. Indeed, a population with mean of order o{n~^^^) could not be 
distinguished from a population with mean zero. Thus we treat the statistically most 
challenging problem. 

Our aim is to show that the probability PpiTc > ta) is well approximated by 
Pf{Tc > to), where c > and ta is given by (2.7), and when Tc and ta are computed 
from independent data. We claim that in this setting the results discussed in sec- 
tion 2.2 continue to hold. In particular, versions of (2.8) and (2.10) in the present 
setting are: 

Pp{n>ta)^PF{Tc>ta) \l + 0{{l + Za)n-'/^ + {l + Za)''n-'}] +R, (2.12) 
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where 7 = E{X^) denotes skewness and the remainder term R has either the form in 
(2.8) or that in (2.10), depending on whether we assume existence of polynomial or 
exponential moments, respectively. In particular, if we take it! = 0{n~^^) then (2.12) 
holds uniformly in all distributions F of the random variable X such that E{\X\^'^) < 
B, E{X) = and E{X^) = 1, and in all a satisfying < ^„ < Bn^/^, provided 
that D2 is sufficiently large; and in the same sense, but with R = 0{exp(— n-^^)} 
where Di G (0, |], (2.12) holds if we replace the assumption E{\X\^'^) < B hj 
P{\X\ > x) < C exp(— provided that D2 G (0,2] is sufficiently large. (We 
require D2 — 2 only if Di — |.) Result (2.12) is derived in section 5.3. Hence to 
first order, the probability of rejecting Hq when Hq is not true is not affected by 
the bootstrap approximation. In particular, to first order, skewness does not affect 
the approximation any more than it would if Hq were true (compare with (2.8) and 
(2.10)). 

An alternative form of (2.12), which is useful in applications (e.g. in section 3), is 
to express the right hand side there more explicitly in terms of a. This can be done 
if we note that, in view of Theorem A in section 5.1, 

1 - <l>(t, - c) 



Pf{T, > ta) = {1 - $(ta)} cxp I - I {2tl -3ctl + c^) 7} ^— 

X [1 + 0{C, n,ta)[{l + ta) n-^/' + (1 + ta)^ jT^ }] 
f -1/2 1 /o i2 2\ 1 1 ~ ^(^a ~ c) 

= o;exp{n ' ^0(3^,-0)7) ^ _ ^^^^^ 



^{ta) 



X 



l + e^{c,n,ta)[{l + ta)n-^l^ + {l + ta)^n-'Y\ ' (2-13) 



where 61 has the same interpretation as 9 in Theorem A, and the last identity follows 
from the definition of ta- Combining this property with (2.12) it can be shown that 

P,{T, >ta)^a exp {n-V2 1 c (3t^ - c^) 7} ^ 

X \l + 0{{l + Za)n-^/^ + {l + Za)^n-'}]+R, (2.14) 

where it! satisfies the properties given below (2.12). 

2.4 Relationships among many events Tc > ta 

So far we have treated only an individual event (i.e. a single univariate test), exploring 
its likelihood. However, since our results for a single event apply uniformly over many 
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choices of the distribution of X then we can develop properties in the context of many 
events, and thus for simultaneous tests. The simplest case is that where the values 
of Tc are independent; that is, we observe T^J^ for 1 < j < p, where c'^^\ . . . , c^^ are 
constants and the random variables T^y^ are, for different values j, computed from 
independent datasets. We assume that T^y^ is defined as at (2.11) but with c = c^^\ 
We could take the values of n = nj to depend on j, and in fact the theoretical 
discussion below remains valid provided that Ci n < rij < C2 n, for positive constants 
Ci and C2, as n increases. (Recall that n is the main asymptotic parameter, and 
p is interpreted as a function of n.) As in the case of a single event, treated in 
Theorems 1 and 2, it is important that the ^-statistic T^fJ) and the corresponding 
quantile estimator be independent for each j. However, as noted in section 2.2, 



this is not a problem since T^^l represents a generic random variable, and only t^^ is 



calculated from the sample. 

Under the assumption that the variables T^y^, and t^a\ for 1 < j < p, are totally 
independent we can deduce from (2.12) that, uniformly in Za satisfying < Za < 

p 

P{T^S)>^jZa for 1<J <p) =n^.(c^'0> 

i=i 

where, for each j, ixij denotes either > or < , and 

V^.(c) = X,(c) ^ P(T^l c^, z^) exp {n-'/' \ c (3 zl - c^) -,^^^] 

1 + 0{{1 + z^) n-^'^ + (1 + Zo,f n-^]\ + (2.15) 



X 



if txij represents >, il!j{c) = 1 — Xj(c) otherwise, 7*^-'^ denotes the skcwncss of the 
jth population, and the remainder terms i?^-^^ have the properties ascribed to R in 
section 2.3. 

It is often unnecessary to assume, as above, that the quantile estimators Vi^ are 
independent of one another. To indicate why, we note that the method for deriving 
expansions such as (2.8), (2.10) and (2.12) involves computing P{Tc > to) by first 
calculating the conditional probability P{Tc > ta \ta), where the independence of Tc 
and ta is used. Versions of this argument can be given for the case of short-range 
dependence among many different values of ta , for I < j < p. However a simpler 
approach, giving a larger but still asymptotically negligible bound to the remainder 
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term 0{. . .} on the right-hand side of (2.15), can be developed more simply; for 
brevity we do not give details here. 

Cases where the statistics are computed from weakly dependent data can be ad- 
dressed using results of Hall and Wang (2010). That work treats instances where the 
variables T^^^]^ are computed from the first n components in respective data streams 
Sj = {Xji,Xj2, ■ ■ .), with Xji,Xj2, ■ ■ ■ being independent and identically distributed 
but correlated between streams. As in the discussion above, since we are treating t- 
statistics then it can be assumed without loss of generality that the variables in each 
data stream have unit variance. (This condition serves only to standardise scale, and 
in particular places the means c^^^ on the same scale for each j.) Assuming this is 
the case, we shall suppose too that third moments are uniformly bounded. Under 
these conditions it is shown by Hall and Wang (2010) that, provided that (a) the 
correlations are bounded away from 1, (b) the streams iSi, ^2, . . . are fc-dependent for 
some fixed k > 1, (c) Za is bounded between two constant multiples of (logp)^/^, 
(d) logp = o(n), and (e) for 1 < j < p we have < c'^^^ = c^^\n) < erT^I'^ (logp)^/^, 
where e ^ as n — )■ cxd; and excepting realisations that arise with probability no 
greater than 1 — 0{p exp(— C^;^)}, where C > 0; the i-statistics T^^) can be con- 
sidered to be independent. In particular, it can be stated that with probability 
1 — 0{p exp(— C z^)} there are no clusters of level exceedences caused by dependence 
among the data streams. 

These conditions, especially (d), permit the dimension p to be exponentially large 
as a function of n. Assumption (e) is of interest; without it the result can fail and 
clustering can occur. To appreciate why, consider cases where the data streams are 
/c-dependent but in the degenerate sense that 5rj+i = . . . = Srj+k for r > 0. Then, 
for relatively large values of c, the value of Tc^^ is well approximated by that of c/Sj, 
where (S? = rT^ X]j<n ^-^i^ ~ empirical variance computed from the first 

n data in the stream ^S^. It follows that, for any r > 1, the values of T^^^~^'^\ for 
1 <i <k, are also very close to one another. Clearly this can lead to data clustering 
that is not described accurately by asserting independence. 

To illustrate these properties we calculated the joint distribution of (Tq^-*, . . . , Tq^'*) 
for short-range dependent p- vectors (Xi, . . . , Xp), and compared this distribution with 
the product of the distributions of the p univariate components Tq , /c = 1, . . . ,p. 
For /c = 1, . . . ,p we took Xk = {Uu - EUk)/V^axUk and Uk = Yl]to^^^j+k- Here, 
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Figure 1: Comparison of the joint distribution function of (T^^\ ■ ■ ■ , T^'') (denoted by 
"True cdf " ) with the product of the distributions of the univariate components Tq'^^ , 
k = 1, . . . ,p (denoted by "Assume indep"), when ~ standardised Pareto(5,5), n = 
50 and, from left to right, {p,9) = (100,0.5), {p,9) = (100,0.2), {p,9) = (10000,0.2). 
The vertical axis gives values of P{Tq^^ < x, . . . , Tjf^ < x) where x is given on the 
horizontal axis. 



< ^ < 1 is a constant and ei, . . . , e^+io denote i.i.d. random variables. Figure 1 
depicts the resulting distribution functions for several values of 6 and p, when the 
sample size n was 50 and the e^s were from a standardised Parcto(5,5) distribution. 
We see that the independence assumption gives a good approximation to the joint 
cumulative distribution function, but, unsurprisingly, the approximation degrades as 
9 (and thus the dependence) increases. The figure also suggests that the independence 
approximation degrades as p becomes very large (10^, in this example). 

3 Application to higher criticism for detecting sparse 
signals in non-Gaussian noise 

In this section we develop higher criticism methods where the critical points are 
based on bootstrap approximations to distributions of t statistics, and show that the 
advantages established in section 2 for bootstrap t methods carry over to sparse signal 
detection. 

Assume wc observe Xij, . . . ,Xnj, for 1 < j < p, where all the observations are 
independent and where, for each j, Xij, . . . ,Xnj are identically distributed. For ex- 
ample, in gene microarray analysis X^ if often used to represent the log-intensity 
associated with the ith subject and the jth gene, /ij represents the mean expression 
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level associated with the jth feature (i.e. gene), and the Z^s represent measurement 
noise. The distributions of the Xijs are completely unknown, and we allow the dis- 
tributions to differ among components. Let E{Xij) — c^^^. The problem of signal 
detection is to test 

Hq-. all c'^^^s are zero, against h["'^: a small fraction of the c^^h is nonzero. (3.1) 

For simplicity, in this section we assume that each c^^'' > 0, but a similar treatment 
can be given where nonzero c^^'^s have different signs. 

To perform the signal detection test we use the ideas in section 2 to construct a 
bootstrap t higher criticism statistic that can be calculated when the distribution of 
the data is unknown, and which is robust against heavy-tailedness of this distribution. 
(Higher criticism was originally suggested by Donoho and Jin (2004) in cases where 
the centered data have a known distribution, non-Studentised means were used, and 
the bootstrap was not employed.) As in section 2.4, let T^^fl be the Studentised 
statistic for the the jth component, and let ta be the bootstrap estimator of the 
1 — a quantile of the distribution of Tq"*"*, both calculated from the data Xij, . . . X^j. 
We suggest the following bootstrap t higher criticism statistic: 

p 

hc„(ao) = max {pa (1 - a)}-'/' Yl {^^% > ^"^) - «} , (3-2) 

where G (0, 1) is small enough for the statistic hc„ at (3.2) to depend only on 
indices j for which t|(^2 is relatively large. This exploits the excellent performance 
of bootstrap approximation to the distribution of the Studentised mean in the tails, 
as exemplified by Theorems 1 and 2 in section 2, while avoiding the "body" of the 
distribution, where the bootstrap approximations are sometimes less remarkable. We 
reject if hc„(Q;o) is too large. 

We could have defined the higher criticism statistic by replacing the bootstrap 
quantiles in definition (3.2) by the respective quantiles of the standard normal dis- 
tribution. However, the greater accuracy of bootstrap quantiles compared to normal 
quantiles, established in section 2, suggest that in the higher criticism context, too, 
better performance can be obtained when using bootstrap quantiles. The superiority 
of the bootstrap approach will be illustrated numerically in section 4. 

Theorem 3 below provides upper and lower bounds for the bootstrap t higher 
criticism statistic at (3.2), under i?o and . We shall use these results to prove 
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that the probabihties of type I and type II errors converge to zero as n — > oo. The 
standard "test pattern" for assessing higher criticism is a sparse signal, with the same 
strength at each location where it is nonzero. It is standard to take c^-') = for all 
but a fraction e„ of js, and — Tn rT^I"^ elsewhere, where r„ 7^ is chosen to make 
the testing problem difficult but solvable. As usual in the higher criticism context we 
take 

6n = = n-^l\ (3.3) 

where ^ e (0, 1) is a fixed parameter. Among these values of /3 the range < /? < | 
is the least interesting, because there the proportion of nonzero signals is so high 
that it is possible to estimate the signal with reasonable accuracy, rather than just 
determine its existence. See Donoho and Jin (2004). Therefore we focus on the most 
interesting range, which is ^ < /3 < 1. For /3 G (^,1) the most interesting values of Tn 



are r„ x ^2 logp, with r„ < ^2 log p. Taking r„ = o(\/2 logp) would render the two 



hypotheses indistinguishable, whereas taking r„ > \J2 logp would render the signal 
relatively easy to discover, since it would imply that the means that are nonzero are 
of the same size as, or larger than, the largest values of the signal- free T'jy)S. In light 
of this we consider nonzero means of size 

T„ = V2rlogp = V2(r/^)logn , (3.4) 

where < r < 1 is a fixed parameter. 

Before stating the theorem we introduce notation. Let Lp > be a generic multi- 
log term which may be different from one occurrence to the other, and is such that 
for any constant c > 0, Lp • p"^ — > 00 and Lp • ■p~'^ — > as p — > 00. We also define the 
"phase function" by 



4 ' 



In the /3-r plane we partition the region {| < /3 < l,p6i(/3) < t < 1} into three 
subregions (i), (ii), and (iii) defined by r < | (1 — 6'), \ (1 — 6') < r < |, and | < r < 1, 
respectively. The next theorem, derived in the longer version of this paper (Delaigle 
et aJ., 2010), provides upper and lower bounds for the bootstrap t higher criticism 
statistic under and respectively. 
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Theorem 3. Let p — n}/^ , where 6 e (0, 1) is fixed, and suppose that, for each 
^ ^ 3 ^ P, the distribution of the respective X satisfies E{X) — 0, E{X'^) — 1 and 
E\X\^'^ < oo, where D2 is chosen so large that (2.8) holds with Di > 1/9. Also, take 
cco = np~^ log p. Then 

(a) Under the null hypothesis Hq in (3.1), there is a constant C > such that 

P{h.Cn{ao) < Clogp} — > 1 05 n — > 00 . 

(b) Let p e (|, 1) and r e (0, 1) be such that r > peiP)- Under in (3.1), where 
c^^^ is modeled as in (3.3) -(3.4), we have 

P{hCn{ao) > Lpp^^^''''^^ ^1 as 00, 

where 

i I - ^ + (1 - 0)/2 - ( v^(r^ - v^)^ if iP, r) is in region (i), 

5{p,r,d)^< r-p + ^, if (/3,r) is in region (ii), 

[ 1 — /5 — (1 — -\/r)^, if {P, r) is in region (iii). 

It follows from the theorem that, if we set the test so as to reject the null hypothesis 
if and only if hc^ > a„, where a„/logp — >■ 00 as n — >■ 00, and a„ = 0{p'^) where 
d < S{/3, r, 9), then as long as r > pe{P), the probabilities of type I and type II errors 
tend to zero as n — >■ 00 (note that 5(/3,r, 9) > 0). 

It is also of interest to see what happens when r < pe{f3), and below we treat sep- 
arately the cases r < p(/3) and p(/3) < r < pe{(3), where p(/3) = pi(/3) > pe{(3) is the 
standard phase function discussed by Donoho and Jin (2004). We start with the case 
r < p{/3). There, Ingster (1999) and Donoho and Jin (2004) proved that for the sizes 
of e„ and t„ that we consider in (3.3)-(3.4), even when the underlying distribution of 
the noise is known to be the standard normal, the sum of the probabilities of type I 
and type II errors of any test tends to 1 as n — > 00. See also Ingster (2001). Since our 
testing problem is more difficult than this (in our case the underlying distribution of 
the noise is estimated from data), in this context too, asymptotically, any test fails if 
r < p{f3). 

It remains to consider the case p(/3) < r < peiP)- In the Gaussian model, i.e. 
when the underlying distribution of the noise is known to be standard normal, it was 
proved by Donoho and Jin (2004) that there is a higher critisicism test for which the 
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Figure 2: Left: r = p(/3) (black) and r = pe{(3) with 9 = 0.25 (blue), 0.5 (green), 
and 0.75 (red). For each 6, in the region sandwiched by two curves r = p{/3) and 
r = pe{(3), higher criticism is successful in the Gaussian case, but maybe not so much 
in the non-Gaussian case. Right: magnification of lower left portion of graph. The 
horizontal and vertical axes depict P and r, respectively. 

sum of the probabilities of type I and type II errors tends to as ri — t- oo. However, 
our study does not permit us to conclude that bootstrap t higher criticism will yield 
a successful test. The reasons for the possible failure of higher criticism are two- 
fold: the sample size, n, is relatively small, and we do not have full knowledge of the 
underlying distribution of the background noise. See Figure 2 for a comparison of the 
two curves r = pe(/3) and r = p(/3). 

The case where p is exponentially large (i.e. n = (logp)" for some constant a > 0) 
can be interpreted as the case 9 = 0, where pe(/3) reduces to (1 — y/l — . In this 
case, if r > (1 — \/l — /3)^ then the sum of probabilities of type I and type II errors 
of hc„ tends to as n tends to oo. The proof is similar to that of Theorem 3 so we 
omit it. 



4 Numerical properties 

First we give numerical illustrations of the results in section 2.1. In Figure 3 we 
compare the right tail of the cumulative distribution functions of Zq and To with 
the right tail of $, denoting the standard normal distribution function, when U 
has increasingly heavy tails. We take X = {U — EU)/{vaiUY^'^ where U = N\N\ 
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Figure 3: Distribution function (F), top, and inverse distribution function {F 
bottom, of To (F stud), Zq (F stand) and of a N(0,1) when U = A^|A^|, when, from 
left to right n = 50, U = N\N\ with n = 100, U = N^\N\ with n = 50, U = N^\N\ 
with n = 100 and where N ~ N(0, 1). 
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Figure 4: Inverse (F^^) of the distribution function of Tq ( ), of the standard 

normal variable ( — ), and 200 bootstrap estimators of the distribution function of 
To (red curves), when X is a standardised F(5,5), n = 50 (left), n = 100 (middle), 
n = 250 (right). 

(moderate tails) or A^^|A^| (heavier tails), with N ~ N(0, 1). The figure shows that $ 
approximates the distribution of To better than it approximates that of Zq, and that 
the approximation of the normal distribution of Zq degrades as the distribution of 
X becomes more heavy-tailed. The figure also compares the right tail of the inverse 
cumulative distribution functions, which shows that the normal approximation is 
more accurate in the tails for To than for Zq. Unsurprisingly, as the sample size 
increases the normal approximation for both To and Zq becomes more accurate. 
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Next we illustrate the results in section 2.2. There we showed that although Tq 
is more robust than Zq against heavy-tailedness of the distribution Fx of X, the 
distribution of Tq is somewhat more affected by the skewness of Fx- To illustrate the 
success of bootstrap in correcting this problem we compare the bootstrap and nor- 
mal approximations for several skewed and heavy-tailed distributions. In particular, 
Figure 4 shows results obtained when X = {U — EU)/{va,TUy^'^, with U ~ F(5,5). 
Since, later in this section, we shall be more interested in approximating quantiles of 
the distribution of Tq, rather than the distribution itself, then in Figure 4 we show 
the right tail of the inverse cumulative distribution function of Tq and 200 bootstrap 
estimators of this tail obtained from 200 samples of sizes n — 50, n — 100 or n = 250 
simulated from Fx. We also show the inverse cumulative distribution function of 
the standard normal distribution. The figure demonstrates clearly that the boot- 
strap approximation to the tail is more accurate than the normal approximation, and 
that the approximation improves as the sample size increases. We experimented with 
other skewed and heavy-tailed distributions, such as other F distributions and several 
Pareto ditributions, and reached similar conclusions. 

Note that, when implementing the bootstrap, the number B of bootstrap samples 
has to be taken sufficiently large to obtain reasonably accurate estimators of the 
tails of the distribution. In general, the larger B, the more accurate the bootstrap 
approximation, but in practice we are limited by the capacity of the computer. To 
obtain a reasonable approximation of the tail up to the quantile ta, where a < |, we 
found that one should take B no less than 100/a. 

Let he and hcnorm denote, respectively, the theoretical and the normal versions of 
the higher criticism statistic, defined by the formula at the right hand side of (3.2), 
replacing there the bootstrap quantiles t^"* by ta^ and Za, respectively, where ta^ 
denote the 1 — a theoretical quantiles of T^^^ and Za denote the 1 — a quantile of the 
standard normal distribution. To illustrate the success of bootstrap in applications of 
the higher criticism statistic, in our simulations we compared the statistic he which we 
could use if we knew the distribution Fx, the bootstrap statistic hc„ defined at (3.2), 
where the unknown quantiles ta^ are estimated as the bootstrap quantities t^^ as 
discussed in the previous paragraph, and the normal version hCnorm- We constructed 
histograms of these three versions of the higher criticism statistic, obtained from 
1000 simulated values calculated under Hq or an alternative hypothesis. For any of 
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the three versions, to obtain the 1000 values we generated 1000 samples of size n, 
of p- vectors {Xi, . . . ,Xp). We did this under Hq, where the mean of each Xj was 
zero, and under various alternatives where we set a fraction e„ of these means 

equal to T„n~^/^, with t„ > 0. As in section 3 we took p — v}/^, e„ = n~^/^ and 
T„ = \/2r logp, where we chose ^ and r to be on the frontier of the r > pg{f3). 

Figure 5 shows the histograms under Hq and under various alternatives h[^^ 
located on the frontier (r = pe{f3), for (3 = ^, (3 = ^ + ^ {1 — 9) , (3 = ^ and (3 = 1), 
when the X/s are standardised F(5, 5) variables, n = 100 and = \. We can see that 
the histogram approximations to the density of the bootstrap hc„ are relatively close 
to the histogram approximations to the density of he. By contrast, the histograms in 
the case of hcnorm show that the distribution of hCnorm is a poor approximation to the 
distribution of he, reflecting the inaccuracy of normal quantiles as approximations to 
the quantiles of heavy-tailed, skewed distributions. We also see that, except when 
(3 = 1, the histograms for he and hc„ under Hq are rather well separated from those 
under This illustrates the potential success of higher criticism for distinguishing 

between Hq and By contrast, this property is much less true for hcnorm- 

We also compared histograms for other heavy-tailed and skewed distribution, such 
as the Pareto, and reached similar conclusions. Furthermore, we considered skewed 
but less-heavy tailed distributions, such as the chi-squared(lO) distribution. There 
too we obtained similar results, but, while the bootstrap remained the best approxi- 
mation, the normal approximation performed better than in heavy-tailed cases. We 
also considered values of (/3,r) further away from the frontier, and, unsurprisingly 
since the detection problem became easier, the histograms under H'f'^ became even 
more separated from those under Hq. 

5 Technical arguments 

5.1 Preliminaries 

Let Tc be as in (2.11). Then the following result can be proved using arguments of 
Wang and Hall (2009). 
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Figure 5: Historgrams of he statisties under Hq (rows 1,3,5) or under ijj" (rows 
2,4,6), when the Xj^s are standardised F(5,5) variables, n = 100, 9 = ^, p = n^^^, 
e„ = n~^^^ and r„ = y/2r logp, where r = pg{(3). In each row, from left to right, 
/3 = ^, /3 = ^ + I (1 — 0), /3 = I and (3 = 1. Rows 1 and 2 are for the theoretical he; 
rows 3 and 4 for hc„: and rows 5 and 6 for hcnorm- 
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(5.1) 



Theorem A. Let B > 1 denote a constant. Then, 

= cxp { - I {2x' -3cx' + c') 7} 
1 - $(a; - c) ^ ^ 

X 1 + 0{c, n, x) |(1 + \x\) n-^l'^ + (1 + kD^n"^} 

as n — )■ 00, where the function 6 is bounded in absolute value by a finite, positive 
constant Ci (B) ( depending only on B ), uniformly in all distributions of X for which 
E\X\'^ < B , E{X^) = 1 and E{X) = 0, and uniformly in c and x satisfying < x < 
Bn^/^ andO<c< ux, where < u < 1. 

We shall employ Theorem A to prove the theorem below. Details are given in a 
longer version of this paper (Delaigle et al, 2010). Take T to be any subset of the 

class of distributions F of the random variable X, such that E'dXl^^^) < B for some 
e > and a constant 1 < B < 00, E{X) = and ii^(X^) = 1. Recall the definition 
of Tq in (2.6), let t = t^ and t = t^ denote the respective solutions of P{Tq > t) = a 
and P{T* >t\X)^a, and recall that = (l-*)-^^). Take 77 e (0, e/{4(6 + e)}), 
and let Tc and ta denote independent random variables with the specified marginal 
distributions. 

Theorem B. Let B > 1 denote a constant. Then, 

Pf{T, > t^) = Pf{T, > t^) exp | c (3 _ ^2^ ^| 

X [l + 0{(l + ^,)n-i/2 + (i + ^j4^-i}] 



+ 



k=i ^ ^ i=\ 



> n' 



-(l/4)-r, 



(5.2) 



as n — >■ OO; uniformly in all F & F and in all c and satisfying Q < z^ < Bn}/'^ 
and Q < c <uza, where < w < 1. 



5.2 Proof of Theorem 2 

The following theorem can be derived from results of Adamczak (2008). 

Theorem C. //Fi, . . . ,F„ are independent and identically distributed random vari- 
ables with zero mean, unit variance and satisfying 

P(|y|>y)<Xiexp(-X2y^) (5.3) 
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for all y > 0, where Ki,K2,^ > 0, then for each A > 1 there exist constants K^^K^^ > 
0, depending only on Ki, K2, ^ and \, such that for all y > 0, 



i=l 



>y] <2exp{ - 



y^ 



+ exp - 
2Xn J \ A 4 



We use Theorem C to bound the remainder terms in Theorem B. If Pf(|X| > x) < 
Ci exp{—C2X^^) and we take y = (1 — E)X^ for an integer k, then (5.3) holds for 
constants Ki and K2 depending on Ci, C2 and ^1, and with ^ = ^i/k. In particular, 
for all x > 0, 



Pi.| ^(l-E)Xf > a;(varX^y/'| < 2 exp ^ 



X 



2Xn 



+ K3 exp - 



X' 



Taking = 1, 2 or 3, and x = Xkn = const, n*^^/^^ ''^ for some 771 > 0; or A; = 4 and 
X = Xkn = const.; we deduce that in each of these settings, 



n 



i=l 



0{ cxp ( - n(^^i/^'=)-''2) I if = 1, 2, 3 
0{exp(-n«i/7i^5)} if A; = 4, 



where r]2 > decreases to zero as r^i | 0. Therefore the 0[. . .] remainder term in (5.2) 
equals 0{ex'p(—n^^^^^^^^~^'^)} , and so Theorem 2 is implied by Theorem B. 

5.3 Proof of (2.12) 

Note that, by Theorem B in section 5.1, Theorems 1 and 2 continue to hold if we 
replace the left-hand sides of (2.8) and (2.10) by Pf{Tc > to), provided we also replace 
the factor a on the right-hand sides by Pf{Tc > to)- The uniformity with which (2.8) 
and (2.10) hold now extends (in view of Theorem B) to c such that Q < c < uza with 
< < 1, as well as to q; satisfying Q < Za < B n}/^. 
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A PAGES 27-37: NOT-FOR-PUBLICATION AP- 
PENDIX 

A.l Proof of Theorem B 

Step 1: Expansions of ta and ta- The main results here are (A. 3) and (A. 5). To 
derive them, take W* to have the distribution of {X* — X)/ S, where -S" is as in (2.1), 
and, for /c = 3 and 4, put 



lu^E{W*^\X) = -^^Y.^X,-^f^ 

1=1 

where W*^ = (W*)^. Letting c = in Theorem A, and taking X there to have the 
distribution of W* conditional on A', we deduce that if 5 > 1 is given. 



— ' " ' ^ = exp ( - n ^''\x^-i) 



X 



1 + ei(n, x) {(1 + \x\) + (1 + 1^1)4 (A.l) 



where 7 = 73 and: 



the random function ©i(n, x) satisfies |0i(n, a;)| < Ci{B) (where Ci{B) 
is the same constant introduced in Theorem A) uniformly in datasets 
X for which S > \ and 74 < and uniformly also in x satisfying 
Q<x<B n^l^. 



(A.2) 



Properties (A.l) and (A.2) imply that^a satisfies: 

ta^Zo, [l-|7^"'/'-2a + ©2(n,a){(l + ^„)"'ri-^/2 + (l + ^«)'ri-^}] , (A.3) 

where 2; = 2;^ is the solution of 1 — ^{za) = ot and, in the case j = 2: 

the random function Qj{n,a) satisfies \Qj{n,a)\ < Cj{B) (where Cj{B) 

is a finite, positive constant) uniformly in datasets X for which S > ^ (^-4) 

and 74 < and uniformly also in a satisfying | < l-a < 1-^Bn^/^) 

Analogously, Theorem A imphes that ta satisfies: 



trv — 



l-l^n-^/^Za + 9{n,a){{l + Za)-'n-'/^ + {l + Zafn-'}'^ , (A.5) 



where z = Za is the solution of 1 — ^(za) = a and: 
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the function 9{n,a) satisfies |^(n, q;)| < C2{B) (with C2{B) denoting 
a finite, positive constant) uniformly in distributions of X for which 
E{X) = 0, E{X^) = 1 and E{X^) < B, and uniformly also in a satisfy- 
ing < ^„ < ^ 71^4. 

The derivations of the pairs of properties (A. 3) and (A. 4), and (A. 5) and (A. 6), 
are similar. For example, suppose that if is given by (A. 3) rather than by P(Tq > 
tal'^) = C(, and that the function 02 in (A. 3) is open to choice except that it should 
satisfy (A.4). If we define p{z) ^ z {1 - ^{z)}/(j){z) = 1 - z'"^ + 3z~'^ - then by 
(A.l), (A.3) and (A.4), 

Pf{T* >ta\X) = {l- $(?„)} exp ( - i ^^^^ 

X [l + QiinJa) {(1 + \ta\) n~'^^ + (1 + 1^.1)'^"'}] 
= (27r)-V2 exp { - I zl (l - f 7 n-V^ - n'^^ i ^ ^} 

X Z-' p{zo) [1 + 63(71, z^) {(1 + z^) + (1 + z^f n-i}] 

= {1 - $(^„)} [1 + 83(71, Z^) {(1 + Z^) 71- V2 + (1 + 71-^}] , 

(A.7) 

where ©3 satisfies (A.4). By judicious choice of ©2, satisfying (A.4), we can ensure 
that ©3 in (A.7) vanishes, up to the level of discreteness of the conditional distribution 
function of Tq . In this case the right-hand side of (A.7) equals simply 1 — (^{za) = a, 
so that indeed has the intended property, i.e. -P(Tq > ta \ X) = a. 
Step 2: Expansions of the difference between and t^- The main results here are 
(A. 10) and (A. 11). To obtain them, first combine (A.3) and (A. 5) to deduce that: 

ta-ta = \zl{^- 7) n~^'^ + 04(n, a) {n'^'^ + {I + z^f n-^] , (A.8) 

where, for j = 4: 

the random function Qj{n,a) satisfies |©j(7z, q;)| < Cj{B) (with Cj{B) 
denoting a finite, positive constant) uniformly in datasets X for which 
S > ^ and 74 < -B; uniformly in distributions of X for which E{X) = 0, 

(A.9) 

E(X'^) = 1 and E{X'^) < B; and uniformly also in a satisfying < Za < 
Bn^l\ 

Using (A.5), (A.6), (A.8) and (A.9) we deduce that: 

tl (ta -ta)^l 4 n-'^^ (7 - 7) + Qnin, a) {{1 + z^f n^''' + (1 + z^f 71"^ } , 

(ta - taf = \ Zl 71-1 _ ^ «){(! + Zaf 71-1 ^ (1 ^ ^^)6 ^-3/2| 
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and (ta - taf = 67(71, a) (1 + z^f n'^/^ < Qsin, a) (1 + Za)"^ n'^, where ©5, . . . , ©9 
(the latter appearing below) satisfy (A. 9). Therefore, 



— "^^a C'^a ^a) + 3^0, (t^ to) + (t^ to) 

= 4 (7 - 7) + ©9(n, «){(! + z,,)^ + (i + ^-i}. (A.IO) 
Similarly, using (A. 5) and (A. 8), 

(ta - cf - {ta -cf = 2 (t„ - C) (f, - t^) + (f„ - 

-i(^a-c);^^(7-7)n-^/' 

+ ©io(c, n, «){(! + + (i + Zo)^n-^] , (A.ll) 

where, for j > 10: 

the random function Qj{c,n,a) satisfies \Qj{c,n,a)\ < Cj{B) (with 
Cj{B) denoting a finite, positive constant) uniformly in datasets X for 
which S > ^ and < B, uniformly in distributions of X for which 
E{X) = 0, E{X'^) = 1 and E{X^) < B, and uniformly also in c such 
that < c < u Za where < u < 1, and in a such that < Za < B n^/^. 



(A.12) 



Step 3: Initial expansion of P{Tc > ta)- To derive (A. 16), the main result in this 
step, note that by (A.3)-(A.5) and (A.ll), 

1 - ^(ta - C) = {Za - C)-^ p{Za - c) {2n)-'/^ exp { - | (ta - cf} 

X 1 + ©ii(c, n, a){(l + Za) n-^/^ + (1 + Za)^ n"^}] 

= {Za - C)-' p{Za - C) (27r)-^/' 

X exp I - I {ta -cf-l {za - c) zl (7 - 7) n"^/'} 

X [1 + ©i2(c, n, a) {(1 + Za) n'^/^ + (1 + ^„)^ n"^}] 

= {1 - ^ta - c)} exp { - I {za - c) zl (7 - 7) n-^'^] 

X [1 + ©i3(c, n, a) {(1 + Zc,) + (1 + ZaY n"^}] . (A.13) 

If Tc is statistically independent of ^q, then, by (5.1), (A.IO) and (A.ll) (the latter 
with c = 0), 

Pf{Tc > tg I tg) 
1 - ^{tg - C) 
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exp{-n-V2i(2t^-3ci^ + c=^)7} 

X 1 + ei4(c, n, a) I (1 + Za) rT^I'^ + (1 + z^f n'^ | j 



X exp 



n 



-1 1 



7(7-7){4-c4} 



X 



1 + Oisfc 



n, a) 1 1 



1 + Za) + (1 + 2;, 



l + ei6(c,n,Q;)|(l + ^«)n-'/' + (l + -Za)^n-^}] ■ (A.14) 



1 - - c 
Combining (A. 13) and (A.14) we deduce that: 

Pp{T, > ta I fa) = Ppin > ta) " exp { - | (z^ - c) zl (7 - 7) U-^'^] 

X [1 + 817(0, n, a) {(1 + Za) + (1 + z^)^ n-^] 



(A15) 



Reflecting (A. 12), let Qi{B) denote the class of distribution functions F of X such 
that E{X) = 0, E{X'^) = 1 and E{X^) < B; write Pp for probability measure when 
X is drawn from the population with distribution function F E Qi; let V denote any 
given event, shortly to be defined concisely; let S{B) be the intersection of V and 
the events S > ^ and 74 < B; and write S{B) for the complement of S{B). In view 
of (A. 15), 



Pf{T, > ta) = Pf{T, >t^)-E 



exp <n ^ {za 



c] z. 



X 



1 + 0{(1 + ^„) + (1 + ^-1}] 



+ 0[P^{^(5)}], 
uniformly in the following sense: 



(A.16) 



uniformly in F e ^i(-B), in c such that < c < wZq, where < < 1, 

and in a such that < z« < ^ n^'^- (A- 17) 

Step 4: SimpliGcation of right-hand side of (A.16). Here we derive a simple formula, 
(A. 28), for the expectation on the right-hand side of (A.16). That result, when 
combined with (A.16) and (A. 17), leads quickly to Theorem B. 

Put Afe = n-^ - EX''), write Vk to denote the event that |Afe| < 

Cgn-^^/^)-" where C3 > and 77 e (0, |), and put X> = X>i n n V3. Observe 
that 



7 = (7 + A3 - 3 Ai A2 - 3Ai + 2 A?) / (1 + A2 - Al) , 



(A.18) 
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Prom this property it can be proved that if I7I < B and C3 is sufficiently small, 
depending only on B, then I7 — 7] < whenever V holds. Therefore if V 

holds, and < c < u Za ior < u < 1, and < Za < B n^/^, then 

n~^^^ \za - c\ zl I7 - 7I < B^ n-" . 



In these circumstances, defining A = n | {z^ — c) z"^ (7 — 7), we have: 



3 

r 



I{V) < n-^^y^ exp {B^ n"'') . (A. 19) 



Note too that ii E{X^) < B then 

£'(A^^^ A^^) < Ci{B) rT^ whenever k-i and k2 take values in the set 

{1, 2, 3}, ri and r2 are nonnegative, and ri + r2 = 1 or 2. (A. 20) 

Also, in the same context as (A. 20), if ri + r2 = 2 then 

< E{\Al\ A-l) < {^(A^-) E{Al^)Y" < C,{B)n-' ; (A.21) 

and if ri = 1 and r2 = 0, and r] is sufficiently small, 

E{\Al\ A-l I{V)] < [EAl^ P{V)Y'' < C,{B,ri) {n'^ n-^^'^^-^Y" 

= C5(5,r;)n-(^/")-(^/2) (A.22) 

where C = C(^) > 0. In deriving (A.22) we used the fact that P{V) < P(Pi) + 
P{V2) + P{V2), and that, by Markov's inequality (employing the fact that E\X\^"^'' 
< 00 and choosing rj < e/{4(3 + e)}), 

P{Vk) < (C3n-(i/^)-^)-^'+^'/'^^£;(|Afe|2+(^/3)) 

< Ce{B,r]) ^(V2)+2^+(e/12)+(W3)-{2+{./3)}/2 < C,{B , 7]) 71-^'^^^''^ 

for A; = 1, 2, 3, where C > 0. Therefore, 

P{V) < 3 Ce{B, rj) n-(^/2)-c _ (a.23) 
If ri + r2 + rs > 3 then an argument similar to that leading to (A. 23) shows that 

E{\AY A^/ A^^l 7(P)} < C,iB,r)) (n-V^)^ ^^-m)-,yi+r,+rs-2 ^^ ^4) 

Combining (A. 20), (A.21), (A.22) and (A. 24); using Taylor expansion to derive ap- 
proximations to 7 — 7, starting from (A. 18); noting the definition of A given in the 
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previous paragraph; and observing that n \za — c\ z"^ < B^n}/^ Q < c < uZo 
with <u < 1, and Q < Za < B n^/^; we deduce that: 



\E{l^n{V)]\ 



< 



Cs{B,j)n^/^n-^ if j = 1 

Cs{B, j) {n^/^Y (n-^^/"^-'')'"' if i > 2 
<Cs{B,j)n-'/'. (A.25) 



Using (A. 19), (A. 23) and (A.25), and choosing r to be the least integer such that 
{r + l)r) > |, we deduce that: 



E 



exp I {z^ - c) zl (7-7)} m] = 1 + 0{n-'/') , (A.26) 



uniformly in the following sense: 

uniformly in F G Q2{B), in c such that < c < uZa with < ti < 1, 
and in a such that 0<Za< Bn^/^, (A.27) 
where Q2{B) denotes the intersection of Qi{B) (defined at (A. 17)) with the class 
of distributions of X such that F(|X|6+') < 5. 

An argument almost identical to that leading to (A.26) and (A.27) shows that the 
same pair of results holds if we replace V by the event Vi that S > \ and j4 < B. The 
only change needed is the observation that, since F El Q\ (B) entails E(\X\^+') < B, 
P{T>i) is uniformly bounded above by a constant multiple of This follows 

from the fact that, \i Yi,Y2, . . . are random variables satisfying < 00, then 

P{\ {l-E) F/l >n} < const. n-^^''^^-^^/^\ Therefore, in the argument in (A.22) 
we can replace the bound const, n"'-^/^-'"'' to P{T>) by the bound const. n~^^/'^^^^^f^^ 
to P{T>i). This means that (A.26) holds if we replace T) there by the event V fl Pi, 
i.e. the event S{B) introduced just above (A. 16). That is. 



E 



exp {n-V2 1 (;,„ _ c) zl (7-7)} li^m] = 1 + 0{n-'/') , (A.28) 



uniformly in the sense of (A.27). 

Together, (A. 16), (A. 17) and (A.28) imply that (5.2) holds uniformly in F G J-", 
in c such that < c < u Za, with < m < 1 and in a such that < < Bn^^'^, 
completing the proof of Theorem B. 
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A. 2 Proof of Theorem 3 

Throughout this proof we use the notation he* = he„(aQ), where = np^^logp 
denotes the value of stated in the theorem. Also, for two positive sequenees a„ 
and bn, we write a„ < 6„ when limsup„^^(a„/6„) < 1. We use the equivalent 
notation an- 

Fix a e (0,1). Let Gp{a) = P'^ Y.]=i I {T% > t^a^), and hcn,a = y/PMl - 
a)}~^^'^{Gp{a) — a}. We have he* = maXa=i/p,i<i<a^p^Cn,a- We introduee a non- 
stoehastie eounterpart 

— * ~ 
he„ = max hen,a 

a=i/p,l<i<aQp 

of he;, where he^,, = y^{a{l - a)}-'/^Gp{a) - a} and Gp{a) = p'^ E?=i P{T^u) > 
^d^). Note that Gp{a) = E{Gp{a)}. 
The keys for the proofs are: 

(A) There is a eonstant C > sueh that 

lim p\ |he* — he„| < Clogp [ = 1, under Hq, 
lim P\ |he; - he* | < ClogpA/l + he* | = 1, under h[''\ 

— * 

(B) Under Ho, there is a eonstant C > sueh that he„ < Clogp for suffieiently large 
n. 

(C) Under //}"\ he* = Lpp^^^^^'^\ 

Combining (A)-(B), there exit constants (7i > and C2 > such that hc^ < Ci \ogp 

■ — • * 

and P{|he* — he„| < C2 logp} = 1 + o(l). Therefore, 

P{he; < (Ci + C2) logp} > P{|he; - hell < C2 logp} = 1 + o(l), 

and part (a) of Theorem 3 follows. Combining (A) and (C) gives that 

P|he; > L^^^'^-'^)} > Pjhe; > he* - Clogp ^1 + he*} ^ 1 as n ^ 00, 

and part (b) of the theorem follows. Note that C and Lp may stand for different 
quantities in different oeeurrenee. 

We now show (A)-(C). Below, whenever we refer to a, we assume that < 
a < a^. By definition, Gp{a) = p'^ Yfj=i P{T% > t^a), where the fraetion of c^^) = 
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is 1 under the null and (1 — e„) under the alternative. Using Theorem 1 and noting 
that 0{n~^^) — o{l/p) and that Za < 0{y/logp) in (2.8), we have 

P{T^Jl > t2'^) = a{l + 0{^/]ogp/^/n)} + o(l/p), when c(^) = 0. (A.29) 

It follows that both under the null and under the alternative, 

(1 - en)a < Gp{a) < (1 - e„)o; + e„. (A.30) 

As a result, uniformly in a e ckq], 

a = o{l), Gp{a) = o{l), pGp{a)>pa>l. (A.31) 

Consider (A). Note that for any integer TV > 1 and any positive sequences and 
6j, maxi<j<7v{ai&i} < maxi<j<jv{aj} •maxi<j<Ar{6i}. By the definition of he* and hc^. 



I he* — hc^l < max 



Vp \Gp{a) - Gp{a)\ ^ j jj 



a=i/p:l<i<a^p ■<ya(Y^^Ci) 

where / is stochastic and // is deterministic, and 

^\Gp{a)-Gp{a)\ \Gpia){l ~ Gpia) 

1 — max , _ _ 11 = max 



a=i/p:l<i<a'^p y/Gp{a){l — Gp{a)) ' a=i/p:l<i<a^p ^/ a{l — a) 

To show (A), it is sufficient to show that both under the null and the alternative, 

P(/ > Clogp) = o(l), (A.32) 

and that 



// < 1 under Hq, II <\/1 + |hc„| under h["'\ (A.33) 
Consider (A.32). Note that 



;p) < V pSVpEM 

a=i/p,l<i<aQp V P\ /\ 



Gp{a)) 



For each a, applying Bennett's inequality [Shorack and Wellner (1986) page 851] with 
= I{T^, > t^J^) - P{T% > t^J^) and A = C\ogipWGp{a)il - Gp{a)), 

dI Vp\Gpia) - Gp{a)\ \_pr^|^.^ n t w \\ 



VGp{a){l - Gp{a)} 



\2 , 2^ 
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where il^iX) — (2/A^){(l+A) log(l+A) — 1} is monotonely decreasing in A and satisfies 
X'^ip{X) ~ 2AlogA for large A, and cr^ is the average variance of Xj: 



p 



On one hand, recall that there is at least a fraction (1 — e„) of c^-'^s that are 0, 



and that when c^^^ = 0, P(tJo) > ^^^) ^ We see that 
On the other hand, by Schwartz inequality. 



(A.36) 



^ 7 = 1 ^ 7 = 1 



It follows from the definition of A that 

A 



> C log p. 



(A.37) 



RecaUing that t/j is monotonely decreasing, and that X'^'4'{X) ~ 2Alog(A) for large A, 
it follows from (A.36)-(A.37) that 

(A.38) 

where C > is a generic constant. Note that the last term in (A.38) = o(l/p). 
Combining (A.34)-(A.35) and (A.37)-(A.38) gives (A.32). 

It remains to prove (A. 33). Recall that Gp{a) — o(l). By the definition of //, 



II — max 

a=i/p,l<i<aQp 



lGp{a){l-Gp{a)) 
a{l — a) 



< 



max 



a=i/p,l<i<aQp 



Gp{a) 



a 



(A.39) 



Under the null, by Theorem 1, Gp(a)/a = 1 + 0(^y\ogp/^/n) + o(l), which gives the 
first assertion in (A. 33). For the second assertion, write 

a{l — a) 



Gp{a) ^ , Gp{a) - a ^ ^ 



a 



a 



-\- hCjj Q, 



Noting that y^a{l — a)/{y/pa) < 1, it follows that 

Gp{a) 



a 



< 1 + |hc„a|- 



(A.40) 
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Combining (A. 39) and (A.40) gives the second claim of (A. 33). 

Consider (B). In this case, the null hypothesis is true and all c^^h equal 0. By 
the definition and (A.29)-(A.31), 

|hc„,.| < ^'^^1^''' < C^a-logp-{p/n) + o{l). (A.41) 

RecaUing that a < Uq with ckq = (n/p) logp, hc„,Q < Clogp and the claim follows. 

Consider (C). In this case, the alternative hypothesis is true, and a fraction (1— e„) 
of c^-'^s is 0, with the remaining of them equal to r„. Using (2.13), where c = t„, 

P{T^U) > = (1 + 0(1)) • PiT^U) > ta) + o{l/p) = ^z^ - T„) + o(l/p), (A.42) 

where l» = 1 — $ is the survival function of a N(0, 1). Combining (A. 29) and (A.42), 

Gp{a) = (1 - en)a{l + O{y^\ogp/Vn) + e„Lp$(^« - t„) + o(l/p), 

and it follows from direct calculations that 

~ _ y/p[Gp{a) - a] 

nc„,Q — ^ ^ 

y q;(1 — a) 

^ y^jenLp^Zg - Tn) + (l - e„)Q;(l + Q(^log(p)/n)) -a + o{l/p)] 

Lpy/pen^Zg-Tn) j a ^ 

= , , VP^n\ jz r + 0{^/p\ogpa/n) + o{l). (A.43) 

y/a{l - a) \ K^-Oi) 

Recall that a < a* = np"^ log p. First, ^Jp€n\/ cil (1 — «) < e„A/p«o — ^n^/n log(p)- 
This equals Lpp^/'^~^ = o(l) because 9 < 1 and /3 > |. Second, i/plogp ■ a/n < 
^/plog{p)aQ < logp. Inserting these into (A.43) gives 

r- _ Lpy/pen^jZq - Tn) 
^^n,a — 1—7- r- + -^p) 

y q;(1 — a) 

and so 

hc„ = /// + Lp, where /// = max = — -. (A.44) 

a=i/p,l<i<alp a/q;(1 — a) 

We now re-parametrize with z^ as 

Za — \/2q\ogp = Sn{q), where g > 0, so that a — ^{sn{q)}- 

By Mill's ratio, we have l»(s„(g)) = LpP'"^- Recall that 1/p < a < ckq, where 
ckq = Lpp^~^. We deduce that the range of possible values for the parameter q runs 
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from (1 — ^) to 1 (with lower order terms neglected). It follows from elementary 
calculus that 

III — max — , = L„ • max j- . 

(A.45) 

Moreover, by Mill's ratio, 

VpenHsniq) - Tn} = L, ■ p-^^'P'^\ (A.46) 

where 

( Q . \ \ 0<q<r, 

Inserting (A.46) into (A.45) gives 

III^L^- max p<i'M+<i/^_ (A.47) 

{l-f)<9<l 

We now analyze 7r(g; /3, r)+q'/2 as a function of g G (0,1]. In region (i), 4r < (1—6*), 
and 7r(g; /3, r) + q/2 is monotonely decreasing in [(1 — ^), 1]. Therefore, the maximizing 
value of gis (1-^), at which 7r(g; /?, r) + g/2 = | - /? + (1 - ^)/2 - {^(1 - ^) - v^F- 
In region (ii), (1 — ^) < 4r < 1. As g ranges between (1 — 9) and 1, 7r(g;^,r) + g/2 
first monotonely increases and reaches the maximum at q — 4r, then monotonely 
decreases. The maximum of 7r(g; (3, r) + q/2 is then r — f3 + ^. In region (iii), 4r > 1, 
and 7r(g; /3, r) + q/2 is monotonely increasing in [(1 — ^^), 1]. The maximizing value of 
q is 1, at which 7i{q; /3,r) + q/2 = 1 — /3 — (1 — -\/r)^. Combining these with (A.47) 
and (A. 44) gives the claim. □ 
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